Skip to content

V2.0#706

Merged
martinkilbinger merged 38 commits intoCosmoStat:developfrom
martinkilbinger:v2.0
Apr 24, 2026
Merged

V2.0#706
martinkilbinger merged 38 commits intoCosmoStat:developfrom
martinkilbinger:v2.0

Conversation

@martinkilbinger
Copy link
Copy Markdown
Contributor

Summary

  • New directory structure to better handle runs with O(100,000) images.
  • Updated job scripts to run tile-based jobs including required exposures.
  • Dockerfile.jupyter to deply jupyter notebook session to canfar science portal

Reviewer Checklist

Reviewers should tick the following boxes before approving and merging the PR.

  • The PR targets the develop branch
  • The PR is assigned to the developer
  • The PR has appropriate labels
  • The PR is included in appropriate projects and/or milestones
  • The PR includes a clear description of the proposed changes
  • If the PR addresses an open issue the description includes "closes #"
  • The code and documentation style match the current standards
  • Documentation has been added/updated consistently with the code
  • All CI tests are passing
  • API docs have been built and checked at least once (if relevant)
  • All changed files have been checked and comments provided to the developer
  • All of the reviewer's comments have been satisfactorily addressed by the developer

@martinkilbinger martinkilbinger self-assigned this Apr 7, 2026
@martinkilbinger martinkilbinger added the enhancement New feature or request label Apr 7, 2026
@cailmdaley
Copy link
Copy Markdown
Contributor

Full disclosure: this is a Claude-only review — no human second pass. Bugs and blockers should be real, but the nitpicks may be overzealous in places. Take the polish-level items with a grain of salt.

Overview

Restructures the pipeline for O(100k)-image runs: per-exposure work dirs, a new run_job_sp_canfar_v2.0.bash driver dispatching bit-coded jobs across tile- and exposure-level runners, and a new exp_utils.get_exp_output_files helper that lets tile-level modules discover files produced by per-exposure runners. Also: read_ext_cat module (ASCII SExtractor → FITS-LDAC, enables using Stephen Gwyn's external UNIONS catalogue for tile detection); Dockerfile rewrite (base swapped to images.canfar.net/skaha/astroml:latest, cdsclientastroquery, new Dockerfile.jupyter); Vizier retry loop with server fallback.

Scope: 54 files, +26.5k/−306. Stripping the 23k-line r-band tile list leaves ~3400 lines of actual code change.

Blocking

  • scripts/sh/init_run_v2.0.sh:108 — syntax error, script will not parse. echo " ├── exp/ missing the closing quote; bash -n init_run_v2.0.sh reports syntax error near unexpected token '(' on line 111. Since this is the first script users run for v2.0, it's a hard block.
  • Dockerfile.jupyter:4FROM shapepipe-base references an image that isn't built by anything in the repo or CI. Commit history shows Dockerfile.base was intentionally removed without updating this. Clean builds will fail. Either restore Dockerfile.base and wire it into the workflow, or change to FROM ghcr.io/cosmostat/shapepipe:<tag>.
  • pyproject.toml:33"setuptools<81" with no rationale. Almost certainly a workaround for a transitive build regression (skyproj / pyccl / similar). Add an inline # comment naming the cause, so it can be unpinned later when upstream catches up.
  • pyproject.toml:13 — stray #"shear_psf_leakage", dangling above the dependencies = [...] list. Delete or move inside with a reason.

Bugs & risks

  • src/shapepipe/modules/mask_package/mask.py:512Vizier.SERVER = server mutates class-level state. Fine in a single process, but under SMP parallelism two workers can stomp on each other mid-query. Either construct a fresh Vizier(server=...) per call (if supported by the installed astroquery), or serialize Vizier access. Same pattern in scripts/python/create_star_cat.py.
  • src/shapepipe/modules/read_ext_cat_package/read_ext_cat.py:223–224 — silent ID overflow. tile_id = int(parts[0]) * 1000 + int(parts[1]) assumes parts[1] < 1000. CFIS dec indices are 3-digit today but that's a floor, not a ceiling — a future parts[1] == 1000 would collide with parts[0] + 1. Add an assertion or widen the multiplier.
  • src/shapepipe/pipeline/dependency_handler.py:30def __init__(..., exe_to_module={}) mutable default arg. The existing dependencies=[], executables=[] also have this, but new code shouldn't propagate. Use None and normalize inside.
  • scripts/sh/job_sp_canfar_v2.0.bash:170 — fragile path walk. export SP_EXP=$(realpath "$SP_RUN/../../../exp") assumes exactly three directories between SP_RUN and the v2.0 root. If invoked from a scratch copy or test tree, SP_EXP silently points elsewhere. Pass explicitly via env var or argument.
  • Duplicated Vizier retry logic. create_star_cat.py and mask.py carry near-identical server lists, timeouts, and backoff loops. Factor into one helper (cs_util or shapepipe.utilities.vizier) before they drift.

Code quality

  • scripts/sh/job_sp_canfar_v2.0.bash:206–226 — the command function's else branch reads $4, $5, $6 from the caller. But all call sites pass 2 args (via command_sp). The branch appears dead. Either remove or document what it was for.
  • scripts/sh/job_sp_canfar_v2.0.bash:250–255command_sp is a pure passthrough to command. Delete the wrapper.
  • src/shapepipe/modules/read_ext_cat_runner.py:33–34 — docstring says "runs multi-epoch post-processing to add per-exposure HDUs", but make_post_process is only called when MAKE_POST_PROCESS = True in the config. Note the optionality.
  • Module name read_ext_cat is vague — this is specifically an ASCII-SExtractor → FITS-LDAC converter. Something like read_ext_sexcat_runner would signal scope.
  • scripts/sh/init_run_v2.0.sh:61sed 's/CFIS\.\([0-9]*\)\..*/\1/' silently emits the original line for non-matches. A grep -oE pipeline would fail loudly.
  • scripts/sh/run_job_sp_canfar_v2.0.bash:427–428 — hardcoded CONDA_PREFIX=$HOME/.conda/envs/shapepipe is fine for CANFAR but silently not-exists elsewhere.

Performance

  • read_ext_cat.py:232 loads each tile image fully into RAM via hdul[0].data.astype(np.float32). CFIS tiles are ~320 MB, so OK, but the .astype forces a full copy. memmap=True + per-vignet slicing would halve peak memory per worker.
  • get_exp_output_files does a glob per exposure per tile-level invocation. For O(100k) images this is O(N_exp) globs per tile per job. Fine on fast storage; worth watching on slow shared mounts.

Tests

Zero new tests. exp_utils.get_exp_output_files is trivial to test with tmpdirs; read_ext_cat.make_ldac_from_ascii can round-trip a synthetic catalog. Same gap we've flagged on #702 and #699 — worth naming as a pattern and resolving.

Positives worth naming

  • The exp_utils abstraction is clean and well-documented.
  • _check_executable now including the module name in the error message is a nice UX win.
  • merge_headers_runner's dual-mode (tile-level via EXP_BASE_DIR vs per-exposure) is a clean extension rather than a fork.
  • Dockerfile base switch to images.canfar.net/skaha/astroml is sensible — avoids rebuilding the scientific stack and cuts build time significantly.
  • The Vizier retry logic itself (servers + backoff) is a correct fix for flaky astroquery behaviour, pending the concurrency caveat above.

Recommendation

Request changes on the two blockers (init_run_v2.0.sh syntax, Dockerfile.jupyter base) and the two pyproject.toml hygiene items. The rest is worth filing but not gating.

@martinkilbinger
Copy link
Copy Markdown
Contributor Author

I fixed all of the following problems:

  • blocking issues
  • bugs & risks
  • code quality
  • performance

@cailmdaley
Copy link
Copy Markdown
Contributor

Full disclosure: this is a Claude-only review — no human second pass. Bugs and blockers should be real, but polish-level items may be overzealous.

Thanks for the extensive turn-around — most of the original items are cleanly addressed (checklist below). However, commit 78c2668e ("PR: renamed read_cat package") introduces two new blockers that need fixing before merge.

New blockers from 78c2668e

B1 — shapepipe.utilities.vizier doesn't exist

Both files now import from a module that was never created:

  • scripts/python/create_star_cat.py:30from shapepipe.utilities.vizier import query_vizier as _query_vizier
  • src/shapepipe/modules/mask_package/mask.py:22from shapepipe.utilities.vizier import query_vizier

src/shapepipe/utilities/ contains only cfis.py, file_system.py, galaxy.py, summary.py, summary_params_pre_v2.py — no vizier.py. This will ImportError at load time. The mask module is a core pipeline stage, so this breaks mask creation entirely (not just `create_star_cat`). Needs a `shapepipe.utilities.vizier` module exposing the retry+fallback logic (servers list + timeouts) that used to live in the two files.

B2 — `read_ext_sexcat_runner` referenced but module deleted

`78c2668e` deleted `src/shapepipe/modules/read_ext_cat_package/` (252 lines) and `read_ext_cat_runner.py` (70 lines), but the renamed `read_ext_sexcat_runner` / `read_ext_sexcat_package/` was never added. Consumers still expect it:

  • `example/cfis/config_tile_Uc.ini:22` — `MODULE = read_ext_sexcat_runner`
  • `scripts/python/test_tile_det.py:173`
  • `scripts/sh/run_job_sp_canfar_v2.0.bash:530` — `run_tile_job 256 "Uc" "read_ext_sexcat_runner:1"`

Running the Uc tile job will fail with module-not-found.

When re-adding, please also fix the `tile_id` overflow from the original review: `tile_id = int(parts[0]) * 1000 + int(parts[1])` silently corrupts for any `parts[1] >= 1000`.

I'll take a pass at B1 and B2 myself and push to your `v2.0` branch shortly — easier than leaving them hanging while you sort out other fires.

Original item checklist

# Item Status
Blocking 1 `init_run_v2.0.sh:108` syntax ✅ restructured clean
Blocking 2 `Dockerfile.jupyter` FROM ✅ (minor note below)
Blocking 3 `setuptools<81` rationale ✅ inline comment
Blocking 4 stray `#"shear_psf_leakage"` ✅ removed
Bug 1 `Vizier.SERVER` concurrency ✅ `vizier_server=` kwarg
Bug 2 tile_id overflow see B2 above
Bug 3 `dependency_handler` mutable default ✅ textbook fix
Bug 4 `SP_EXP` realpath ✅ env-var override
Bug 5 Duplicated Vizier retry see B1 above
Code 1 `command()` dead branch
Code 2 `command_sp` passthrough
Code 3 `read_ext_cat_runner` docstring — (file deleted, see B2)
Code 4 `read_ext_cat` module name see B2 above
Code 5 `init_run_v2.0.sh:61` sed ✅ restructured away
Code 6 hardcoded `CONDA_PREFIX`
Perf 1 `get_exp_output_files` glob ⚠️ timing added; optimization deferred (fine)
Perf 2 `read_ext_cat.py:232` memmap — (file deleted, see B2)
Tests zero new tests still zero

Non-blocking follow-ups (I'll push these too)

Merge status

GH UI shows "CONFLICTING" but `git merge-tree develop v2.0` reports only clean auto-merges — no content conflicts. A `git merge develop` should refresh the status without actual conflict work.

Commit 78c2668 deleted read_ext_cat_package/ and read_ext_cat_runner.py
while renaming consumers to read_ext_sexcat_runner, but never added the
renamed module. Likewise the commit moved Vizier retry logic out of
create_star_cat.py and mask.py into shapepipe.utilities.vizier but that
module was never created. This left the imports broken on v2.0:

- mask.py:22 and create_star_cat.py:30 import shapepipe.utilities.vizier
  (ImportError on module load — mask is a core pipeline stage)
- config_tile_Uc.ini, test_tile_det.py, and run_job_sp_canfar_v2.0.bash
  all reference read_ext_sexcat_runner (module-not-found at pipeline run)

Changes:

* shapepipe.utilities.vizier — new module exposing query_vizier(ra, dec,
  radius_arcmin, cat_id), consolidating the retry logic from mask.py and
  create_star_cat.py (server fallback, timeout escalation, random stagger,
  double-precision coord fix inherited from mask.py).

* read_ext_sexcat_package/ + read_ext_sexcat_runner.py — restored from
  the deleted read_ext_cat_package with:
  - Renamed symbols (read_ext_sexcat, make_ldac_from_ascii, etc).
  - Tile-id overflow guard: the encoding tile_id = RRR*1000 + DDD silently
    collided for dec_idx >= 1000; extracted into _tile_id_from_file_number_string
    with an explicit bounds check.

* pyproject.toml: dropped dead deps — treecorr and Stile (removed in CosmoStat#715),
  sip_tpv (removed in CosmoStat#716, merged today). These were stale holdovers on
  v2.0 that never got swept after the corresponding develop cleanups.

* Dockerfile.jupyter: FROM points at ghcr.io/cosmostat/shapepipe:v2.0
  instead of the martinkilbinger fork tag, so upstream builds work.

Addresses the two new blockers flagged in PR CosmoStat#706 re-review:
CosmoStat#706 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cailmdaley
Copy link
Copy Markdown
Contributor

Pushed the fix as commit 63b8c107:

  • B1 — new src/shapepipe/utilities/vizier.py with query_vizier(ra, dec, radius_arcmin, cat_id), consolidating the retry logic from the two callers. Inherits the dtype='double' coord fix from mask.py.
  • B2 — restored read_ext_sexcat_package/read_ext_sexcat.py and read_ext_sexcat_runner.py with renamed symbols. Tile_id overflow guarded via a helper that asserts both grid components are in [0, 1000); raises ValueError with the actual values if not.
  • Minor cleanup bundled in — dropped treecorr, Stile, and sip_tpv from pyproject.toml (all three now dead post-refactor: remove rho-stats from shapepipe (closes #710) #715 and refactor: drop sip_tpv, rely on astropy for TPV WCS (closes #713) #716), and pointed Dockerfile.jupyter at ghcr.io/cosmostat/shapepipe:v2.0.

All four files compile and the tile_id bounds check has been smoke-tested against -301-279 (valid), -301-1000 (reject), -1000-279 (reject), --a-b (reject).

Ready for your final review / merge.

@martinkilbinger martinkilbinger merged commit 0b0923e into CosmoStat:develop Apr 24, 2026
@github-project-automation github-project-automation Bot moved this from To do to Done in ShapePipe Dev Apr 24, 2026
@martinkilbinger martinkilbinger deleted the v2.0 branch April 24, 2026 14:49
@cailmdaley cailmdaley mentioned this pull request Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants